After learning online course from the Data Camp, I think all the programming languages are essentially the same. R is very similar to Python, and some functions are just like one the the Python library (Pandas), since they all use Dataframe object to store information, so I find it very easy to understand, and I think R is focused on the data processing where as Python is more of general purpose type of language.
My github repository is here.
My course diary web page is here.
something <- value
# some_comment (just like python)
Cmd/ctrl + Enter
+ # Addition
- # Subtraction
* # Multiplication
/ # Division
^ (or **) # Exponentiation
some_function(object)
mean(data_source)
head(students2014, n = a)
# its a good practice to add the argument name to avoid confusion
str(data_source)
# but in python it prints out string of an object
"Strongly disagree" = 1
"Strongly agree" = 5
# Boolean values: TRUE or FALSE
# all upper case
c(2, 3, 4.1, 5) # has to be same data type
summary(data source)
# it will return some important statistical value of the selected data
plot(x, y, "p", main = "some title")
names[1]
# start at 1 not 0
names[1] <- "new_value"
names[c(1, 3)] # select value from these positions, order matters
c[1, 2, 3, 4, 5]
1:5
# manupulate the vector
(1:5)*2
# length of the data
length(data)
slicing
data[begin:end]
# possible to modify the vector when slicing a dataframe
== # exactly equal to
!= # not equal to
< # less than
> # greater than
<= # less or equal to
>= # greater or equal to
!a # NOT a
a & b # a AND b
a | b # a OR b
subset(data_source, conditions...)
for (counter in vector) {
commands
more commands
}
access a library
function_name <- function(arg1, arg2 = 5) return(return_value) # arg = default value
varible <- read.table("source_of data", sep="\t", header=TRUE) # read data
dim(data) # check dimension
str(data) # check structure
colSums(df) # returns a sum of each column in df
rowSums(df) # returns a sum of each row in df
colMeans(df) # returns the mean of each column in df
rowMeans(df) # return the mean of each row in df
select(dataframe, one_of(columes)) # select columns to create new datafram
colnames(data)[columm_index] <- "new_name" # modify column names
filtered_data <- filter(original_data, conditions) # filter the data with conditions
# If-else statement
if(condition) {
do something
} else {
do something else
}
c(1,2,3,4,5) / 2 # scaling vectors
%>% # pipe operator
select(data_source, one_of(vector_of_columns)) # select few columns as new one
colnames(data) # print colunm names
data_frame <- as.data.frame(scaled) # change data to data frame
sample() # choose ramdom data
mutate() # adding new variables as mutations of the existing ones
jointed_data <- inner_join(first_data, second_data, by = common_columns, suffix = c(".first", ".second"))
The original data has 60 variables and 183 observations; most of the questions were given on Likert scale, from 1 to 5, except for the few background related variables (age, attitude, points). For the analysis part, the variables related to deep learning, surface learning, and strategic learning have been scaled using R script.
# Read in the data
new_lrn14 <- as.data.frame(read.csv('data/learning2014.csv'))
A matrix of plots of variables of the data can be drawn as follows, coloured according to gender variable:
p <- ggpairs(new_lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
summary(new_lrn14)
Results:
gender age attitude deep stra surf points
F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250 Min. :1.583 Min. : 7.00
M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
Median :22.00 Median :32.00 Median :3.667 Median :3.188 Median :2.833 Median :23.00
Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121 Mean :2.787 Mean :22.72
3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000 Max. :4.333 Max. :33.00
The gender variable inllustrate that this course was attended by more female students. Most of the variables are distributed randomly and shows weak correlation with the points. The correlation between points and attitude is the largest between all the variables. Most of the variables seems to be close to normal distribution except gender and age.
lrn2014_model <- lm(points ~ attitude + stra + surf, data = new_lrn14)
summary(lrn2014_model)
Call:
lm(formula = points ~ attitude + stra + surf, data = new_lrn14)
Residuals:
Min 1Q Median 3Q Max
-17.1550 -3.4346 0.5156 3.6401 10.8952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.01711 3.68375 2.991 0.00322 **
attitude 0.33952 0.05741 5.913 1.93e-08 ***
stra 0.85313 0.54159 1.575 0.11716
surf -0.58607 0.80138 -0.731 0.46563
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
As can be seen from the model summary, the estimates of strategic and surface learning have large P-value and thus no statistical significance explaining the course points; also as expected according to the weak correlation with points variable. Thus, it makes more sense to remove the strategic learning due to the highest probability value.
Remove strategic learning variable:
lrn2014_model <- lm(points ~ attitude + stra, data = new_lrn14)
summary(lrn2014_model)
all:
lm(formula = points ~ attitude + surf, data = new_lrn14)
Residuals:
Min 1Q Median 3Q Max
-17.277 -3.236 0.386 3.977 10.642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.11957 3.12714 4.515 1.21e-05 ***
attitude 0.34264 0.05764 5.944 1.63e-08 ***
surf -0.77895 0.79556 -0.979 0.329
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.32 on 163 degrees of freedom
Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
With the removal of strategic learning variable, the statistical significance of the surface learning estimates improves and can be included to this model. The model is not very good since the multiple R2 value is only 0.1953, which means that about 80 % of the relationship between the dependent variable and the explanatory variables still remains unexplained. Therefore, any predictions based on the model might not be very reliable.
For a linear model, there are following assumptions:
The validity of these assumption can be tested by analysing the residuals of the model. In the following figure three different diagnostic plots are drawn:
par(mfrow = c(1,3))
plot(new_lrn14_model, which = c(1,2,5))
Interpretations:
alc <- as.data.frame(read.table('data/alc.csv', sep="\t", header=TRUE))
glimpse(alc)
Results:
Observations: 382
Variables: 35
$ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
$ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M, M, M, M, M, F, F, M, M, M, M,…
$ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 17, 16, 15, 15, 1…
$ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, R, U, U, U, U, U,…
$ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3, LE3, GT3, GT3, GT3, GT3, GT3,…
$ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T, T, T, T, T, T, T, T, T, A, T,…
$ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4, 4, 4, 4, 2, 2, 2, 2, 4, 3, 4,…
$ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3, 3, 4, 2, 2, 4, 2, 2, 2, 4, 4,…
$ Mjob <fct> at_home, at_home, at_home, health, other, services, other, other, services, other, teache…
$ Fjob <fct> teacher, other, other, services, other, other, other, teacher, other, other, health, othe…
$ reason <fct> course, course, other, home, home, reputation, home, home, home, home, reputation, reputa…
$ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
$ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, yes…
$ guardian <fct> mother, father, mother, mother, father, mother, mother, mother, mother, mother, mother, f…
$ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,…
$ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 2, 3, 1, 1, 1, 2, 2,…
$ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, n…
$ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes, yes, no, no, yes, no, no, yes…
$ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no, no, no, yes, yes, yes, yes, …
$ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
$ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
$ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3, 4, 5, 4, 5, 4, 1, 4, 2, 5, 4,…
$ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1, 4, 4, 5, 4, 3, 2, 2, 2, 3, 4,…
$ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3, 1, 2, 1, 4, 2, 2, 2, 4, 3, 5,…
$ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 5,…
$ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3, 1, 1, 3, 4, 1, 3, 2, 4, 1, 5,…
$ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5, 1, 5, 5, 5, 5, 5, 5, 1, 5, 5,…
$ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5, 0, 0, 1, 1, 2, 10, 5, 2, 3, 1…
$ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16, 13, 10, 7, 10, 12, 12, 14, 12…
$ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16, 14, 12, 6, 11, 14, 14, 14, 1…
$ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 16, 14, 12, 6, 11, 14, 14, 15, …
$ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0, 2.0, 1.5, 1.0, 1.5, 1.5, 1.0,…
$ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
I’ve chosen the following 4 variables that I assume are indicative of students’ alcohol consumption:
A summary and plots of the chosen variables are shown below. Also, grouped the box plots by sex to see possible correlation.
summary(alc[c('G3','failures','absences','studytime')])
G3 failures absences studytime
Min. : 0.00 Min. :0.0000 Min. : 0.0 Min. :1.000
1st Qu.:10.00 1st Qu.:0.0000 1st Qu.: 1.0 1st Qu.:1.000
Median :12.00 Median :0.0000 Median : 3.0 Median :2.000
Mean :11.46 Mean :0.2016 Mean : 4.5 Mean :2.037
3rd Qu.:14.00 3rd Qu.:0.0000 3rd Qu.: 6.0 3rd Qu.:2.000
Max. :18.00 Max. :3.0000 Max. :45.0 Max. :4.000
# plot
p1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(failures)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")
m <- glm(high_use ~ G3 + failures + absences + studytime, data = alc, family = "binomial")
summary(m)
Call:
glm(formula = high_use ~ G3 + failures + absences + studytime,
family = "binomial", data = alc)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1570 -0.8393 -0.6605 1.1103 2.1227
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02933 0.56149 0.052 0.958346
G3 -0.03519 0.03868 -0.910 0.362964
failures 0.29443 0.20422 1.442 0.149383
absences 0.07766 0.02271 3.420 0.000626 ***
studytime -0.47423 0.15920 -2.979 0.002893 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 465.68 on 381 degrees of freedom
Residual deviance: 429.15 on 377 degrees of freedom
AIC: 439.15
Number of Fisher Scoring iterations: 4
According to the model summary, my hypothesis was wrong with regard the final grade and failure. The other two explanatory variables are relatively strong predictors. Odds ratios were calculated:
or <- exp(coef(m))
or
(Intercept) G3 failures absences studytime
1.0297605 0.9654265 1.3423577 1.0807536 0.6223632
As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade or the absence. With regard failures and studytime the results are, however, a student who fails more likely to also drink a lot. Andwho studies a lot is almost half less likely to drink a lot.
Confidence intervals:
ci <- exp(confint(m))
cbind(or, ci)
or 2.5 % 97.5 %
(Intercept) 1.0297605 0.3412502 3.1040114
G3 0.9654265 0.8948227 1.0417846
failures 1.3423577 0.9000830 2.0158575
absences 1.0807536 1.0359926 1.1326415
studytime 0.6223632 0.4513693 0.8437505
With regard the final grade, my hypothesis was wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to remove that variable from the model:
m <- glm(high_use ~ failures + absences + studytime, data = alc, family = "binomial")
summary(m)
Call:
glm(formula = high_use ~ failures + absences + studytime, family = "binomial",
data = alc)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1536 -0.8385 -0.6825 1.1469 2.1604
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.36590 0.35603 -1.028 0.304080
failures 0.36378 0.19023 1.912 0.055845 .
absences 0.07897 0.02271 3.477 0.000507 ***
studytime -0.48619 0.15855 -3.066 0.002166 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 465.68 on 381 degrees of freedom
Residual deviance: 429.97 on 378 degrees of freedom
AIC: 437.97
Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
2.5 % 97.5 %
(Intercept) 0.6935745 0.3442726 1.3944631
failures 1.4387512 0.9917543 2.1021998
absences 1.0821747 1.0373417 1.1341316
studytime 0.6149678 0.4465274 0.8325472
# Predict the probability.
prob <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = prob)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
prediction
high_use FALSE TRUE Sum
FALSE 254 14 268
TRUE 91 23 114
Sum 345 37 382
round(addmargins(prop.table(tbl)), 2)
prediction
high_use FALSE TRUE Sum
FALSE 0.66 0.04 0.70
TRUE 0.24 0.06 0.30
Sum 0.90 0.10 1.00
high_u <- as.data.frame(prop.table(table(alc$high_use)))
predic <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(high_u, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(predic, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)
mloss <- function(obs, prob) {
res <- ifelse(prob > 0.5, 1, 0)
mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
[1] 0.27
The training error is 27%, the accuracy of this model is 73%.
# access the MASS package
library(MASS)
# load the data
data("Boston")
glimpse(Boston)
Observations: 506
Variables: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004, 0.…
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, 0.0, 0.0, 0.0,…
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 8.14, 8.14, 8.…
$ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0…
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004, 6.377, 6.009, 5.889, 5…
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 39.0, 61.8, 84.5, 5…
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921, 6.3467, 6.22…
$ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307, 307, 307, 307, 307, 30…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 21.0, 21.0, 21…
$ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.63, 386.71, 392.52, 396.…
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45, 13.27, 15.71, 8.26, 1…
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19…
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p
summary(Boston)
crim zn indus chas nox rm
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780
age dis rad tax ptratio black
Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
The dataset must be standardised which means all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
crim zn indus chas nox
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
rm age dis rad tax ptratio
Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We also need to categorise our target variable – crim – to classify it:
# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
## low med_low med_high high
## 127 126 126 127
In order to create the LDA model and to test it, the data has to be divided into training and testing sets.
n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n, size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2475248 0.2549505 0.2500000 0.2475248
Group means:
zn indus chas nox rm age dis rad tax
low 0.8788254 -0.90299708 -0.193587063 -0.8578620 0.4283921 -0.8810268 0.8563773 -0.6993483 -0.7586291
med_low -0.1000971 -0.32495455 -0.004759149 -0.5695319 -0.1409065 -0.3721597 0.3648300 -0.5470148 -0.4791331
med_high -0.3751653 0.09765591 0.234426408 0.3247534 0.1474674 0.3919174 -0.3256314 -0.3974038 -0.3377904
high -0.4872402 1.01715195 -0.075474056 1.0130937 -0.3741645 0.7948136 -0.8597796 1.6377820 1.5138081
ptratio black lstat medv
low -0.42566133 0.38009442 -0.75226396 0.528774452
med_low -0.09874892 0.31131227 -0.12813291 0.003927945
med_high -0.28221456 0.09962635 -0.03927833 0.188486538
high 0.78037363 -0.77500232 0.89169331 -0.705092197
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.15327760 0.72696090 -0.96918821
indus 0.06895212 -0.23203144 0.19852525
chas -0.08674273 -0.16587825 0.09866070
nox 0.29905137 -0.69375411 -1.34777143
rm -0.08073322 -0.13239391 -0.18618294
age 0.35186877 -0.34166351 -0.30960892
dis -0.04116222 -0.25001198 0.05134464
rad 3.22140261 0.98659875 -0.38833796
tax -0.01928876 -0.13190509 0.89928951
ptratio 0.11143195 0.02470684 -0.30292304
black -0.12568460 0.01433414 0.08414446
lstat 0.21707935 -0.13116253 0.54061068
medv 0.17787842 -0.28268695 -0.07678544
Proportion of trace:
LD1 LD2 LD3
0.9527 0.0337 0.0136
The output shows that we have three linear discriminants. The first explains vast majority – 95.27% – of the between-group variance.
# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.
lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
table(correct = correct_classes, predicted = lda.pred$class)
predicted
correct low med_low med_high high
low 15 6 3 0
med_low 3 18 4 0
med_high 0 15 15 2
high 0 0 0 21
As seen from the table, the model didn’t make good predict for the “med_high” category. Thus, the model can be used to make crude predictions, but it’s hardly perfect.
boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:
km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:
k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)
# Visualize the results.
plot(1:k_max, twcss, type='b')
For comparison, re-cluster the data with just two clusters:
km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
It can be seen from the plot there is lees overlap compared to that of four clusters, which means the optimal clusters is two.
# Read in the data
human <- as.data.frame(read.table('data/human.csv', sep="\t", header=TRUE))
glimpse(human)
Observations: 155
Variables: 8
$ se_f_of_m <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.9690608, 0.9927835, 1.0241730, 1.0031646, 1…
$ lfp_f_of_m <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.8286119, 0.8072289, 0.7797357, 0.8171263, 0…
$ edu_exp <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9, 19.2, 15.4, 15.8, 16.2, 19.0, 16.9,…
$ life_exp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0, 81.8, 83.0, 82.2, 80.7, 82.6, 81.9,…
$ gni_cap <dbl> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 52947, 42155, 32689, 76628, 45636, 39267…
$ mmr <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 11, 6, 6, 12, 4, 4, 7, 4, 4, 5, 5, 11,…
$ abr <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3, 6.0, 6.5, 25.8, 11.5, 2.2, 7.8, 8.3…
$ mp_share <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2, 31.4, 25.3, 43.6, 23.5, 41.3, 16.3,…
From the output, there are 155 observations (countries) and 8 variables, the description is shown below:
ggpairs(human)
summary(human)
se_f_of_m lfp_f_of_m edu_exp life_exp gni_cap mmr
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040 Median : 49.0
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0
abr mp_share
Min. : 0.60 Min. : 0.00
1st Qu.: 12.65 1st Qu.:12.40
Median : 33.60 Median :19.30
Mean : 47.16 Mean :20.91
3rd Qu.: 71.95 3rd Qu.:27.95
Max. :204.80 Max. :57.50
corrplot(round(cor(human), digits = 2), method = "circle", type = "upper", tl.pos = "d", tl.cex = 0.8)
The plot shows strongest correlation of all is the strong negative correlation between life expectancy (life_exp) and maternal mortality rate (mmr). The next strongest correlations are the positive correlations between life expectancy (life_exp) and education expectancy (edu_exp) and maternal mortality rate (mmr) and adolescent birth rate (abr).
pca_human <- prcomp(human)
s_human_nonstd <- summary(pca_human)
s_human_nonstd
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# Percentages of variance for the plot titles.
pr_shns <- round(100*s_human_nonstd$importance[2, ], digits = 1)
pc_shns_lab <- paste0(names(pr_shns), " (", pr_shns, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shns_lab[1], ylab = pc_shns_lab[2])
human_std <- scale(human) # Standardise the variables.
pca_human_std <- prcomp(human_std)
s_human_std <- summary(pca_human_std)
s_human_std
pr_shs <- round(100*s_human_std$importance[2, ], digits = 1)
pc_shs_lab <- paste0(names(pr_shs), " (", pr_shs, "%)")
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shs_lab[1], ylab = pc_shs_lab[2])
## 5.3 Interpreting the PCA Results
The 1st principal component (PC1) explains 53.6% of the total variance of the original 8 variables, and the 2nd principal component (PC2) explains 16.2% of total variance of the original 8 variables.
From the plot, we can conclude that the PC1 represents variables related mostly to poverty&health and the PC2 related mostly to gender equality.
Its safe to say that total variance in the data coms from poverty and health, but gender equality also explains some of it.
From the arrows, we can see that high maternal mortality rate (mmr) and adolescent birth rate (abr) correlate strongly with poverty and that high life expectancy (life_exp), high educational expectancy (edu_exp), high ratio of females with secondary education (se_f_of_m) and high GNI (gni_cap) have a strong negative correlation with it.
Further, we can see that high ratio of female MPs (mp_share) and high ratio of female participation in the labour force (lfp_f_of_m) have strong positive correlation with gender equality.
data("tea")
glimpse(tea)
Observations: 300
Variables: 36
$ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, breakfast, Not.breakfast, break…
$ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea time, Not.tea time, Not.tea time, tea…
$ evening <fct> Not.evening, Not.evening, evening, Not.evening, evening, Not.evening, Not.evening, …
$ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lu…
$ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, dinner, Not.dinner, Not.dinner,…
$ always <fct> Not.always, Not.always, Not.always, Not.always, always, Not.always, Not.always, Not…
$ home <fct> home, home, home, home, home, home, home, home, home, home, home, home, home, home,…
$ work <fct> Not.work, Not.work, work, Not.work, Not.work, Not.work, Not.work, Not.work, Not.wor…
$ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.t…
$ friends <fct> Not.friends, Not.friends, friends, Not.friends, Not.friends, Not.friends, friends, …
$ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, Not.resto, Not.resto, Not.resto,…
$ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, No…
$ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Grey, Earl Grey, black, Earl Gr…
$ How <fct> alone, milk, alone, alone, alone, alone, alone, milk, milk, alone, alone, alone, mi…
$ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar,…
$ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag+unp…
$ where <fct> chain store, chain store, chain store, chain store, chain store, chain store, chain…
$ price <fct> p_unknown, p_variable, p_variable, p_variable, p_variable, p_private label, p_varia…
$ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 66, 65, 60, 35, 72, 73, 80, 76,…
$ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F, F, F, F, F, F, F, F, F, F, F,…
$ SPC <fct> middle, middle, other worker, student, employee, student, senior, middle, senior, s…
$ Sport <fct> sportsman, sportsman, sportsman, Not.sportsman, sportsman, sportsman, sportsman, sp…
$ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-44, 35-44, 35-44, 25-34, 25-34,…
$ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/week, 1 to 2/week, +2/day, +2/da…
$ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-exoticism, escape-exoticism, esc…
$ spirituality <fct> Not.spirituality, Not.spirituality, Not.spirituality, spirituality, spirituality, N…
$ healthy <fct> healthy, healthy, healthy, healthy, Not.healthy, healthy, healthy, healthy, Not.hea…
$ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diuretic, Not.diuretic, Not.diureti…
$ friendliness <fct> Not.friendliness, Not.friendliness, friendliness, Not.friendliness, friendliness, N…
$ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not.iron absorption, Not.iron absorption,…
$ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminine, Not.feminine, Not.feminine,…
$ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sophisticated, sophisticated, Not.sophist…
$ slimming <fct> No.slimming, No.slimming, No.slimming, No.slimming, No.slimming, No.slimming, No.sl…
$ exciting <fct> No.exciting, exciting, No.exciting, No.exciting, No.exciting, No.exciting, No.excit…
$ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxing, relaxing, relaxing, relaxin…
$ effect.on.health <fct> No.effect on health, No.effect on health, No.effect on health, No.effect on health,…
cha <- dplyr::select(tea, one_of(c('Tea','How','sugar','where','healthy','sex')))
glimpse(cha)
Observations: 300
Variables: 6
$ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Grey, Earl Grey, black, Earl Grey, black…
$ How <fct> alone, milk, alone, alone, alone, alone, alone, milk, milk, alone, alone, alone, milk, milk,…
$ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar…
$ where <fct> chain store, chain store, chain store, chain store, chain store, chain store, chain store, c…
$ healthy <fct> healthy, healthy, healthy, healthy, Not.healthy, healthy, healthy, healthy, Not.healthy, hea…
$ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F,…
ggplot(gather(cha), aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
cha_mca <- MCA(cha, graph = FALSE)
summary(cha_mca)
Call:
MCA(X = cha, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
Variance 0.234 0.220 0.212 0.193 0.169 0.159 0.139 0.124 0.119 0.096
% of var. 14.063 13.193 12.749 11.603 10.119 9.545 8.356 7.462 7.141 5.769
Cumulative % of var. 14.063 27.256 40.005 51.608 61.727 71.271 79.627 87.090 94.231 100.000
Individuals (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | -0.104 0.015 0.009 | -0.165 0.041 0.023 | 0.325 0.165 0.089 |
2 | 0.577 0.473 0.212 | -0.259 0.102 0.043 | 0.047 0.003 0.001 |
3 | 0.180 0.046 0.052 | -0.255 0.099 0.106 | -0.568 0.507 0.523 |
4 | -0.469 0.313 0.286 | 0.097 0.014 0.012 | -0.019 0.001 0.000 |
5 | -0.388 0.215 0.142 | -0.150 0.034 0.021 | -0.257 0.104 0.062 |
6 | -0.089 0.011 0.011 | -0.261 0.103 0.091 | -0.126 0.025 0.021 |
7 | -0.089 0.011 0.011 | -0.261 0.103 0.091 | -0.126 0.025 0.021 |
8 | 0.577 0.473 0.212 | -0.259 0.102 0.043 | 0.047 0.003 0.001 |
9 | 0.007 0.000 0.000 | 0.486 0.357 0.119 | 0.199 0.062 0.020 |
10 | 0.640 0.583 0.266 | -0.145 0.032 0.014 | 0.403 0.255 0.105 |
Categories (the 10 first)
Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test Dim.3 ctr cos2
black | 0.826 11.979 0.224 8.177 | -0.320 1.919 0.034 -3.170 | 0.608 7.164 0.121
Earl Grey | -0.235 2.536 0.100 -5.468 | 0.415 8.395 0.311 9.636 | -0.342 5.893 0.211
green | -0.476 1.774 0.028 -2.895 | -1.708 24.331 0.361 -10.385 | 0.634 3.469 0.050
alone | -0.117 0.638 0.026 -2.768 | -0.357 6.285 0.237 -8.416 | -0.364 6.768 0.247
lemon | -0.129 0.130 0.002 -0.782 | 1.152 11.064 0.164 7.003 | 1.112 10.662 0.153
milk | -0.026 0.010 0.000 -0.235 | 0.369 2.163 0.036 3.286 | 0.386 2.455 0.040
other | 3.202 21.870 0.317 9.737 | 0.934 1.985 0.027 2.841 | 1.116 2.931 0.039
No.sugar | 0.534 10.463 0.304 9.541 | -0.486 9.248 0.252 -8.688 | -0.142 0.823 0.022
sugar | -0.570 11.185 0.304 -9.541 | 0.519 9.885 0.252 8.688 | 0.152 0.879 0.022
chain store | -0.237 2.562 0.100 -5.470 | -0.203 2.001 0.073 -4.682 | -0.335 5.622 0.199
v.test
black 6.021 |
Earl Grey -7.936 |
green 3.855 |
alone -8.586 |
lemon 6.758 |
milk 3.442 |
other 3.394 |
No.sugar -2.547 |
sugar 2.547 |
chain store -7.715 |
Categorical variables (eta2)
Dim.1 Dim.2 Dim.3
Tea | 0.229 0.457 0.211 |
How | 0.318 0.284 0.291 |
sugar | 0.304 0.252 0.022 |
where | 0.248 0.306 0.362 |
healthy | 0.159 0.020 0.028 |
sex | 0.147 0.000 0.362 |
Drawing a variable biplot of the analysis:
par(mfrow = c(1,3)) # Set some graphical params.
plot(cha_mca, choix = "var", title = "MCA variables") # The variable biplot.
plot(cha_mca, choix = "ind", invisible = "var") # The individuals plot.
plot(cha_mca, choix = "ind", invisible = "ind") # The categories plot.
As can be seen from the leftmost plot, the strongest link is the tea variable’s – i.e. types of tea – link to the second dimension.
The plots on the center and on the right shows no clear pattern in either of them.